9/20/2018

University of Arkansas Statistics Seminar

Introduction

  • Climate change is well understood globally.
  • Climate change is less well understood locally.
  • Need for spatailly explicit reconstructions of climate variables.
  • Problem: data souruces are messy and noisy.

Introduction

Predicting the future by learning from the past

Introduction

Predicting the future by learning from the past

Introduction

Predicting the future by learning from the past

Introduction

Predicting the future by learning from the past

Learning about the past

Climate proxy data

  • Many ecological and phyiscal processes respond to climate over different time scales.
    • Tree rings, corals, forest landscapes, ice rings, lake levels, etc.


  • These processes are called climate proxies.
    • They are proxy measurements for unobserved climate.
    • Noisy and messy.
    • Respond to a wide variety of non-climatic signals.

Pollen Data

Pollen Data

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} [\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}] [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}] [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

  • Posterior.

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

  • Posterior.

  • Data Model

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} \color{blue}{[\mathbf{Z} | \boldsymbol{\theta}_P]} [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

  • Posterior.

  • Data Model.

  • Process Model.

Model Framework

  • Bayesian hierarchical model.

\(\begin{align*} \color{cyan}{[\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}]} & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} \color{blue}{[\mathbf{Z} | \boldsymbol{\theta}_P]} \color{orange}{[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P]} \end{align*}\)

  • Posterior.

  • Data Model.

  • Process Model.

  • Prior Model.

Data Model

\(\begin{align*} [\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto \color{red}{[\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}]} [\mathbf{Z} | \boldsymbol{\theta}_P] [\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

Data model

  • Describes how the data are collected and observed.
  • Researchers take sediment samples from a lake.
  • Take 1cm\(^3\) cubes along the length of the sediment core.
  • In each cube, researcher counts the first \(N\) pollen grains and identifies to species.
  • Raw data are counts of each species

Data Model

For location \(\mathbf{s}\) and time \(t\),

\(\begin{align*} \mathbf{y} \left( \mathbf{s}_i, t \right) & = \left( y_{1} \left( \mathbf{s}_i, t \right), \ldots, y_{d} \left( \mathbf{s}_i, t \right) \right)' \end{align*}\)

is an observation of a \(d\)-dimensional compositional count.

  • \(y_{j} \left( \mathbf{s}_i, t \right)\) is the count of species \(j\) in the sample at location \(\mathbf{s}_i\) and time \(t\).
  • Compositional count data.
    • Total count is not informative of the absolute composition.
    • Informative of the relative proportions \(p_{j} \left( \mathbf{s}_i, t \right)\) only.

Data Model

  • Compositional count vector \(\mathbf{y} \left( \mathbf{s}_i, t \right)\) a function of latent proportions \(\mathbf{p}\left( \mathbf{s}_i, t \right)\)


\(\begin{align*} \mathbf{y}\left( \mathbf{s}_i, t \right) | \mathbf{p}\left( \mathbf{s}_i, t \right) & \sim \operatorname{Multinomial} \left( N\left( \mathbf{s}_i, t \right), \mathbf{p}\left( \mathbf{s}_i, t \right) \right) \end{align*}\)


  • \(N\left( \mathbf{s}_i, t \right) = \sum_{j=1}^d y_{j}\left( \mathbf{s}_i, t \right)\) is the total count observed (fixed and known) for observation at location \(\mathbf{s}_i\) and time \(t\).

  • Compositional count vector \(\mathbf{y} \left( \mathbf{s}_i, t \right)\) a function of latent proportions \(\mathbf{p}\left( \mathbf{s}_i, t \right)\)


Overdispersion

  • The pollen data are highly variable and overdispersed.


\(\begin{align*} \mathbf{p}\left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) & \sim \operatorname{Dirichlet} \left( \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) \right) \end{align*}\)


  • Marginalize out \(\mathbf{p} \left( \mathbf{s}_i, t \right)\) to get Dirichlet-multinomial


\(\begin{align*} \mathbf{y}\left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) & \sim \operatorname{Dirichlet-Multinomial} \left( N\left( \mathbf{s}_i, t \right), \boldsymbol{\alpha}\left( \mathbf{s}_i, t \right) \right) \end{align*}\)


Process Model

\(\begin{align*} [\mathbf{Z}, \boldsymbol{\theta}_D, \boldsymbol{\theta}_P | \mathbf{y}] & \propto [\mathbf{y} | \boldsymbol{\theta}_D, \mathbf{Z}] \color{blue}{[\mathbf{Z} | \boldsymbol{\theta}_P]}[\boldsymbol{\theta}_D] [\boldsymbol{\theta}_P] \end{align*}\)

Process Model

  • We model the Dirichlet-multinomial count data using the log link function:


  • \(\begin{align*} \operatorname{log} \left( \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) \right) & = \mathbf{z}\left( \mathbf{s}_i, t \right) \boldsymbol{\beta}. \end{align*}\)


  • \(\mathbf{z}\left( \mathbf{s}_i, t \right)\)’ is a \(q\)-dimensional vector of covariates.


  • \(\boldsymbol{\beta}\) is a \(q \times d\) dimensional matrix of regression coefficients.

Process Model

  • The \(\mathbf{z} \left( \mathbf{s}_i, t \right)\)s are observed only at \(t\) = 1.


  • Calibration:
    • Estimate \(\boldsymbol{\beta}\) using:
      • \(\left( \mathbf{y} \left( \mathbf{s}_1, 1 \right), \ldots, \mathbf{y} \left( \mathbf{s}_n, 1 \right) \right)'\) and
      • \(\left( \mathbf{z} \left( \mathbf{s}_1, 1 \right), \ldots, \mathbf{z} \left( \mathbf{s}_n, 1 \right) \right)'\).


  • Reconstruction:
    • Use estimated \(\boldsymbol{\beta}\)s and fossil pollen \(\mathbf{y} \left( \mathbf{s}, t \right)\) to predict unobserved \(\mathbf{z}\left( \mathbf{s}, t \right)\).


Calibration Model

Non-linear Process Model

  • Vegetation response to climate is non-linear.


  • \(\begin{align*} \operatorname{log} \left( \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) \right) & = \mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right) \boldsymbol{\beta} \end{align*}\)


  • \(\mathbf{B} \left( \mathbf{z}\left( \mathbf{s}_i, t \right) \right)\) is a basis expansion of the covariates \(\mathbf{z}\left( \mathbf{s}_i, t \right)\).
    • Use B-splines or Gaussian Processes as a basis.


  • Note that for \(t \neq 1\), the \(\mathbf{z} \left( \mathbf{s}_i, t \right)\)s are unobserved

Calibration Model

Dynamic Model

Process Model

  • We are interested in estimating the latent process \(\mathbf{z} \left( \mathbf{s}, t \right)\).


  • The model can accommodate:
    1. continuous vs. discrete space (geostatistical vs. CAR models)
    2. continuous vs. discrete time (stochastic process vs. AR models)


  • For now, we focus on continuous space and discrete time

Dynamic Model

  • For \(\mathbf{z} \left(t \right) = \left( \mathbf{z} \left(\mathbf{s}_1, t \right)', \ldots, \mathbf{z} \left(\mathbf{s}_n, t \right)' \right)\), we assume:

\(\begin{align*} \mathbf{z} \left(t \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} & = \mathbf{M}\left(t\right) \left( \mathbf{z} \left(t-1 \right) - \mathbf{X} \left( t \right) \boldsymbol{\gamma} \right) + \boldsymbol{\eta} \left(t \right) \end{align*}\)


  • \(\mathbf{M}(t) = \rho \mathbf{I}_n\) is a propogator matrix
  • \(\mathbf{X} \left(t \right) \boldsymbol{\gamma}\) are the fixed effects from covariates like latitude, elevation, etc.
  • \(\boldsymbol{\eta} \left( t \right) \sim \operatorname{N} \left( \mathbf{0}, \tau^2 \mathbf{R}\left( \boldsymbol{\phi} \right) \right)\)
  • \(\tau^2\) is the spatial process variance
  • \(\mathbf{R} \left( \boldsymbol{\phi} \right)\) is a Mátern spatial covariance matrix with parameters \(\boldsymbol{\phi}\)

Time Uncertainty

  • Each fossil pollen observation includes estimates of time uncertainty.
    • The time of the observation is uncertain.
    • Weight the likelihoods according to age-depth model.
    • Posterior distribution of ages.
  • For each observation fossil pollen observation an age-depth model gives a posterior distribution over dates.
    • Define \(\omega \left(\mathbf{s}_i, t \right)\) as P(age \(\in (t-1, t)\)). Then
    • \([\mathbf{y} \left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha} \left( \mathbf{s}_i, t \right) ] = \prod_{t=1}^T [\mathbf{y} \left( \mathbf{s}_i, t \right) | \boldsymbol{\alpha} \left( \mathbf{s}, t \right)]^{\omega_\left(\mathbf{s}_i, t \right)}\)

Simulation Study

Simuated data

Simulated Reconstruction

Simulated Reconstruction Temporal Trend

Reconstruction

Reconstruction in time

Reconstruction